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Abstract 


We propose a novel convolutional architecture, named genCNN, for word sequence predic¬ 
tion. Different from previous work on neural network-based language modeling and genera¬ 
tion (e.g., RNN or LSTM), we choose not to greedily summarize the history of words as a fixed 
length vector. Instead, we use a convolutional neural network to predict the next word with the 
history of words of variable length. Also different from the existing feedforward networks for 
language modeling, our model can effectively fuse the local correlation and global correlation 
in the word sequence, with a convolution-gating strategy specifically designed for the task. 
We argue that our model can give adequate representation of the history, and therefore can 
naturally exploit both the short and long range dependencies. Our model is fast, easy to train, 
and readily parallelized. Our extensive experiments on text generation and n-best re-ranking 
in machine translation show that genCNN outperforms the state-of-the-arts with big margins. 

1 Introduction 


Both language modeling ( Wu and Khudanpur, 2003 1 Mikolov et al., 201 0||Bengio et al., 2003 ) and text 
generation (Axelrod et al., 2011) boil down to modeling the conditional probability of a word given 
the proceeding words. Previously, it is mostly done through purely memory-based approaches, such 
as n-grams, which cannot deal with long sequences and has to use some heuristics (called smoothing) 
for rare ones. Another family of methods are based on distributed representations of words, which 
is usually tied with a neural-network (NN) architecture for estimating the conditional probabilities of 
words. 

Two categories of neural networks have been used for language modeling: 1) recurrent neural 
networks (RNN), and 2) feedfoward network (FFN): 

• The RNN-based models, including its variants like LSTM, enjoy more popularity, mainly due 
to their flexible structures for processing word sequences of arbitrary lengths, and their recent 
empirical success! Sutskever et al., 2014) Graves, 2013[ ). We however argue that RNNs, with 
their power built on the recursive use of a relatively simple computation units, are forced to make 
greedy summarization of the history and consequently not efficient on modeling word sequences, 
which clearly have a bottom-up structures. 


• The FFN-based models, on the other hand, avoid this difficulty by directly taking the history as 
input. Flowever the FFNs are fully-connected networks, rendering them inefficient on capturing 
local structures of languages. Moreover their “rigid” architectures make it futile to handle the 
great variety of patterns in long range correlations of words. 

We propose a novel convolutional neural network architecture, named r/enCNN, for efficiently com¬ 
bining local and long range structures of language with the purpose of modeling conditional probabil¬ 
ities. genCNN can be directly used in generating a word sequence (i.e., text generation) or evaluating 











prediction: "sandwich”? 



history: 111 was starving after this long meeting, so I rushed to wal-mart to buy a 


Figure 1: The overall diagram of a pe?rCNN. Here “/” stands for a zero padding. In this example, 
each CNN component covers 6 words, while in practice the coverage is 30-40 words. 


the likelihood of a word sequence (i.e., language modeling). We also show the empirical superiority 
of genCNN on both tasks over traditional n-grams and its RNN and FFN counterparts. 

Notations: We use V to denote the vocabulary, e f (£ {1, • • • , |V|}) to denote the t lh word in a 
sequence ei : x == [ei, ■ ■ ■ , ex’], and ef' 1 if the sequence itself is further indexed by n. 

2 Overview 

As shown in Figure [T] penCNN is overall recursive, consisting of CNN-based processing units of two 
types: 

• aCNN as the “front-end”, dealing with the history that is closest to the prediction; 

• /fCNNs (which can repeat), in charge of more “ancient” history. 

Together, penCNN takes history ei :f of arbitrary length to predict the next word e /+ 1 with probability 

p(e t+ i|ei:t;0), (1) 

based on a representation 0) produced by the CNN, and a |V|-class soft-max: 

p(e t+1 |ei :i ; 0) oc e /ie *+^( ei:t ) +6et+1 _ (2) 


fjenCNN is fully tailored for modeling the sequential structure in natural language, notably different 
from conventional CNN ( Lawrence et al., 1997||Hu et al., 2014| ) in 1) its specifically designed weights- 
sharing strategy (in aCNN), 2) its gating design, and 3) certainly its recursive architectures. Also 
distinct from RNN, genCNN gains most of its processing power from the heavy-duty processing units 
(i.e.,aCNN and /3CNNs), which follow a bottom-up information flow and yet can adequately capture 
the temporal structure in word sequence with its convolutional-gating architecture. 


3 gen CNN: Architecture 

We start with discussing the convolutional architecture of aCNN as a stand-alone sentence model, and 
then proceed to the recursive structure. After that we give a comparative analysis on the mechanism 
of genCNN. 

aCNN, just like a normal CNN, has fixed architecture with predefined maximum words (denoted as 
L a ). History shorter than L a will filled with zero paddings, and history longer than that will be folded 
to feed to /3CNN after it, as will be elaborated in Section [33] Similar to most other CNNs, aCNN 
alternates between convolution layers and pooling layers, and finally a fully connected layer to reach 
the representation before soft-max, as illustrated by Figure [2j Unlike the toyish example in Figure |2} 















in practice we use a larger and deeper oCNN with L a = 30 or 40, and two or three convolution layers 
(see Section |4~Tj ). 

In Section |3.1| we will introduce the hybrid design of convolution in penCNN for capturing struc¬ 
tures of different nature in word sequence prediction. In Section |3.2[ we will discuss the design of 
gating mechanism. 


probability of next word 




Figure 2: Illustration of a 3-layer aCNN. Here the unfilled nodes stand for the Time-Time feature- 
maps, and the the filled nodes for Time-Arrow. 


3.1 oCNN : Convolution 


Different from conventional CNN, the weights of convolution units in a CNN is only partially shared. 
More specifically, in the convolution units there are two types feature-maps: Time-Flow and the 
Time-Arrow, illustrated respectively with the unfilled nodes and filled nodes in Figure [2] The pa¬ 
rameters for Time-Flow arc shared among different convolution units, while for Time-Arrow the 
parameters are location-dependent. Intuitively, Time-Flow acts more like a conventional CNN (e.g., 


that in (Hu et al., 2014)), aiming to understand the overall temporal structure in the word sequences; 


Time-Arrow, on the other hand, works more like a traditional NN-based language model (Vaswani 


et al., 2013^ Bengio et al., 20031: with its location-dependent parameters, it focuses on capturing the 
prediction task and the direction of time. 




















For sentence input x={xi, • • • , xy}, the feature-map of type-/ on Layer-/ is 
if / 6 Time-Flow: 


^(x)=a(w^zf- 1 ) + 6^), 


if / g Time-Arrow: 


zp } (x) = <r(w. 
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(3) 

(4) 


where 


,(*>/) 


(x) gives the output of feature-map of type-/ for location i in Layer-/:; 


cr(-) is the activation function, e.g., Sigmoid or Relu (Dahl et al., 2013) 

bjF^) denotes the location-independent parameters for / £ Time-Flow on Layer-/, 
while (wj^, ) stands for that for /£ Time-Arrow and location i on Layer-/; 


*-i) 


denotes the segment of Layer-/—1 for the convolution at location i , while 


.(0) def f T T . . . T ]T 
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concatenates the vectors for k\ words from sentence input x. 

3.2 Gating Network 


Previous CNNs, including those for NLP tasks ( |Hu et al., 2014[ |Kalchbrenner et al., 2014| ), take 
a straightforward convolution-pooling strategy, in which the “fusion” decisions (e.g., selecting the 
largest one in max-pooling) are made based on the values of feature-maps. This is essentially a soft 
template matching (Lawrence et al., 1997), which works for tasks like classification, but undesired for 
maintaining the composition functionality of convolution. In this paper, we propose to use separate 
gating networks to release the scoring duty from the convolution, and let it focus on composition. 
Similar idea has been proposed by (Socher et al., 2011) for recursive neural networks on parsing 
tasks, but never been combined with a convolutional architecture. 



Layer-/+1 

Layer-/ 


Layer-/—1 


Figure 3: Illustration for gating network. 

Suppose we have convolution feature-maps on Layer-/ and gating (with window size = 2) on Layer- 
Z+l. For the j th gating window (2j —1, 2/), we merge z ^j-i an< 3 ' * as l h e ' n put (denoted as 

z { p) for gating network, as illustrated in Figure |3j We use a separate gate for each feature-map, but 
follow a different parametrization strategy for Time-Flow and Time-Arrow. With window size = 
2, the gating is binary, we use a logistic regressor to determine the weights of two candidates. For 
/£ Time-Arrow, with location-dependent w gate , the normalized weight for left side is 






























gf +1J) = l/a + e-w^M 0 ), 


while for For /e Time-Flow, the parameters for the corresponding gating network, denoted as w gate , 
are shared. The gated feature map is then a weighted sum to feature-maps from the two windows: 
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(5) 


We find that this gating strategy works significantly better than direct pooling over feature-maps, and 
also slightly better than a hard gate version of Equation ([5]). 

3.3 Recursive Architecture 

As suggested early on in Section[2]and Figure[l] we use extra CNNs with conventional weight-sharing, 
named /3CNN, to summarize the history out of scope of aCNN. More specifically, the output of /3CNN 
(with the same dimension of word-embedding) is put before the first word as the input to the aCNN, as 
illustrated in Figure [4] Different from aCNN, /3CNN is designed just to summarize the history, with 
weight shared across its convolution units. In a sense, /3CNN has only Time-Flow feature-maps. 
All /3CNN are identical and recursively aligned, enabling genCNN to handle sentences with arbitrary 
length. We put a special switch after each /3CNN to turn it off (replacing a pading vector shown as 
“/” in Figure [dj) when there is no history assigned to it. As the result, when the history is shorter than 
L n , the recursive structure reduces to aCNN. 
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Figure 4: genCNN with recursive structure. 

In practice, 90+% sentences can be modeled by aCNN with L a = 40 and 99+% sentences can 
be contained with one extra /3CNN. Our experiment shows that this recursive strategy yields better 



















estimate of conditional density than neglecting the out-of-scope history (Section [6.1.2[ ). In practice, we 
found that a larger (greater L a ) and deeper aCNN works better than small aCNN and more recursion 
of /iCNN. which is consistent with our intuition that the bottom-up convolutional architecture is well 
suited for modeling the sequence. 


3.4 Analysis 

3.4.1 Time-Flow vs. Time-Arrow 

Both conceptually and systemically, penCNN gives two interweaved treatments of word history. With 
the globally-shared parameters in the convolution units, Time-Flow summarizes what has been said. 
The hierarchical convolution+gating architecture in Time-Flow enables it to model the composition 
in language, yielding representation of segments at different intermediate layers. Time-Flow is 
aware of the sequential direction, inherited from the space-awareness of CNN, but it is not sensitive 
enough about the prediction task, due to the uniform weights in the convolution. 

On the other hand, Time-Arrow, living in location-dependent parameters of convolution units, 
acts like an arrow pin-pointing the prediction task. Time-Arrow has predictive power all by itself, 
but it concentrates on capturing the direction of time and consequently short on modelling the long- 
range dependency. 

Time-Flow and Time-Arrow have to work together for optimal performance in predicting what 
is going to be said. This intuition has been empirically verified, as our experiments have demonstrated 
that Time-Flow or Time-Arrow alone perform inferiorly. One can imagine, through the layer- 
by-layer convolution and gating, the Time-Arrow gradually picks the most relevant part from the 
representation of Time-Flow for the prediction task, even if that part is long distance ahead. 

3.4.2 gereCNN vs. RNN-LM 

Different from RNNs, which recursively applies a relatively simple processing units, genCNN gains 
its ability on sequence modeling mostly from its flexible and powerful bottom-up and convolution 
architecture. genCNN takes the “uncompressed” history, therefore avoids 

• the difficulty in finding the representation for history (i.e., unfinished sentences), especially those 
end in the middle of a chunk (e.g.,“the cat sat on the”), 


• the damping effort in RNN when the history-summarizing hidden states are updated at each time, 
which renders the long term memory rather difficult. 


Both drawbacks can only be partially ameliorated with complicated design of gates (Hochre- 


iter and Schmidhuber, 1997) and or more heavy processing units (essentially a fully connected 


DNN) (Sutskever et a h, 2014 1. 


4 ryenCNN: Training 


The parameters of a penCNN 0 consists of the parameters for CNN 0 nn , word-embedding 
© embed, and the parameters for soft-max Q so ftmax ■ All the parameters are jointly learned by 
maximizing the likelihood of observed sentences. Formally the log-likelihood of sentence S n 


(def 


a O) Jn) 
-1 > > 


,(«) 


]) is 


log p(«S„;0) =5^1ogP( e t n) |ei?t- 1 ;0) ) 

t=i 


which can be trivially split into T n training instances during the optimization, in contrast to the training 
of RNN that requires unfolding through time due to the temporal-dependency of the hidden states. 









4.1 Implementation Details 

Architectures: In all of our experiments (Section [5] and [6]) we set the maximum words for aCNN 
to be 30 and that for /3CNN to be 20. aCNN have two convolution layers (both containing Time- 
Flow and Time-Arrow convolution) and two gating layers, followed by a fully connected layer 
(400 dimension) and then a soft-max layer. The numbers of feature-maps for Time-Flow are respec¬ 
tively 150 (1st convolution layer) and 100 (2nd convolution layer), while Time-Arrow has the same 
feature-maps. /3CNN is relatively simple, with two convolution layer containing only Time-Flow 
with 150 feature-maps, two gating layers and a fully connected layer. We use ReLU as the activa¬ 
tion function for convolution layers and switch to Sigmoid for fully connected layers. We use word 
embedding with dimension 100. 


Soft-max: Calculating a full soft-max is expensive since it has to enumerate all the words in vocab¬ 
ulary (in our case 40K words) in the denominator. Here we take a simple hierarchical approximation 
of it, following ( |Bahdanau et ah, 2014 1 . Basically we group the words into 200 clusters (indexed by 
c m ), and factorize (in an approximate sense) the conditional probability of a word p(et|ei : t_i; 0) into 
the probability of its cluster and the probability of e t given its cluster 


p(c m |ei:t_i;0)p(e 4 | 

Cm: © softmax)- 


We found that this simple heuristic can speed-up the optimization by 5 times with only slight loss of 
accuracy. 


Optimization: We use stochastic gradient descent with mini-batch (size 500) for optimization, aided 
further by AdaGrad ( Duchi et al., 201 1~| >. For initialization, we use Word2Vec (Mikolov et al., 20131 
for the starting state of the word-embeddings (trained on the same dataset as the main task), and set all 
the other parameters by randomly sampling from uniform distribution in [—0.1, 0.1]. The optimization 
is done mainly on a Tesla K40 GPU, which takes about 2 days for the training on a dataset containing 
1M sentences. 


5 Experiments: Sentence Generation 

In this experiment, we randomly generate sentences by recurrently sampling 

e *+i ~p(et+i|ei :t ;e), 

and put the newly generated word into history, until eos (end-of-sentence) is generated. We consider 
generating two types of sentences: 1) the plain sentences, and 2) sentences with dependency parsing, 
which will be covered respectively in Section |54~| and [572] 

5.1 Natural Sentences 

We train penCNN on Wiki data with 112M words for one week, with some representative examples 
randomly generated given in Table [T] (upper and middle blocks). We try two settings, by asking 
genCNN to 1) finish a sentence started by human (upper block), or 2) generate a sentence from the 
beginning (middle block), or It is fairly clear that most of the time gen CNN can generate sentences that 
are syntactically grammatical and semantically meaningful. More specifically, most of the sentences 
can be aligned to a parse tree with reasonable structure. It is also worth noting that quotation marks ('' 
and '') are always generated in pairs and in the correct order, even across a relatively long distance, 
as exemplified by the first sentence in the upper block. 











'' we are in the building of china ' s social development and the businessmen 
audience , '' he said . 

clinton was born in DDDD , and was educated at the university of edinburgh. 
bush 's first album , '' the man '' , was released on DD november DDDD . 

it is one of the first section of the act in which one is covered in real 
place that recorded in norway . 

this objective is brought to us the welfare of our country 

russian president putin delivered a speech to the sponsored by the 15th asia 
pacific economic cooperation ( apec ) meeting in an historical arena on oct . 
light and snow came in kuwait and became operational , but was rarely 
placed in houston . 

johnson became a drama company in the DDDDs , a television broadcasting 
company owned by the broadcasting program . 

( ( the two * sides ) * should ( * assume ( a strong * target ) ) ) . ) 

( it * is time ( * in ( every * country ) ★ signed ( the * speech ) ) ) 

( ( initial * investigations ) * showed ( * that ( spot * could ( * be ( 

further * improved significantly ) ) ) 

( ( a * book ( to * northern ( the 21 st * century ) ) ) . ) 


Table 1: Examples of sentences generated by penCNN. In the upper block (row 1-4) the underline 
words are given by the human; In the middle block (row 5-8), all the sentences are generated with¬ 
out any hint. The bottom block (row 9-12) shows the sentences with dependency tag generated by 
f/enCNN trained with parsed examples. 


5.2 Sentences with Dependency Tags 

For training, we first parsc( |Klein and Manning, 2002 1 the English sentences and feed sequences with 
dependency tags as follows 

( I * like ( red * apple ) ) 


to genCNN, where 1) each paired parentheses contain a subtree, and 2) the symbol indicates 
that the word next to it is the dependency head in the corresponding sub-tree. Some representative 
examples generated by penCNN are given in Table [T] (bottom block). As it suggests, genCNN is 
fairly accurate on respecting the rules of parentheses, and probably more remarkably, it can get the 
dependency tree head correct most of the time. 


6 Experiments: Language Modeling 


We evaluate our model as a language model in terms of both peiplexity (Brown et al., 1992) and 
its efficacy in re-ranking the n-bcst candidates from state-of-the-art models in statistical machine 
translation, both with comparison to the following competitor language models. 

Competitor Models we compare genCNN to the following competitor models 


5-gram: We use SRI Language Modeling Toolkit (Stolcke and others, 2002) to train a 5-gram 
language model with modified Kneser-Ney smoothing; 


FFN-LM: The neural language model based on feedfoward network (Vaswani et al., 2013). We 


vary the input window-size from 5 to 20, while the performance stops improving after window 
size 20; 



























RNN: we use the implementatiorQof RNN-based language model with hidden size 600 for opti 
mal performance of it; 


LSTM: we use the code in Groundhog} but vary the hyper-parameters, including the depth and 
word-embedding dimension, for best performance. LSTM ( Hochreiter and Schmidhuber, 1997| 
is widely considered to be the state-of-the-art for sequence modeling. 


6.1 Perplexity 

We test the performance of penCNN on Penn Treebank and FBIS, two public datasets with dif¬ 
ferent sizes. 


6.1.1 On Penn Treebank 

Although a relatively small dataset Q Penn Treebank is widely used as a language modelling 
benchmark (Graves, 2013 Mikolov et al., 20101. It has 930,000 words in training set, 74,000 words 


in validation set, and 82,000 words in test set. We use exactly the same settings as in (Mikolov et al., 


2010), with a 10,000-words vocabulary (all out-of-vocabulary words are replaced with unknown) 


and end-of-sentence token (EOS). In addition to the conventional testing strategy where the models 
are kept unchanged during testing, [Mikolov et al. (2010j ) proposes to also update the parameters in an 
online fashion when seeing test sentences. This new way of testing, named “dynamic evaluation”, is 
also adopted by Graves (2013| ). 

From Table [2j genCNN manages to give perplexity superior in both metrics, with about 25 point 
reduction over the widely used 5-gram, and over 10 point reduction from LSTM, the state-of-the-art 
and the second-best performer. We defer the comparison of penCNN valiants to next experiment on 
a larger dataset (FBIS), since Penn Treebank is too small for evaluating some of the differences 
between them. 


Model 

Perplexity 

Dynamic 

5-gram, KN5 

141.2 

- 

FFNN-LM 

140.2 

- 

RNN 

124.7 

123.2 

FSTM 

126 

117 

penCNN 

116.4 

106.3 


Table 2: Penn Treebank results, where the 3rd column are the perplexity in dynamic evaluation, 
while the numbers for RNN and LSTM are taken as reported in the paper cited above. The numbers 
in boldface indicate that the result is significantly better than all competitors in the same setting. 


6.1.2 On FBIS 

The FBIS corpus (LDC2003E14) is relatively large, with 22.5K sentences and 8.6M English words. 
The validation set is NIST MT06 and test set is NIST MT08. For training the neural network, we limit 
the vocabulary to the most frequent 40,000 words, covering ~99.4% of the corpus. Similar to the first 
experiment, all out-of-vocabulary words are replaced with unknown and the EOS token is counted in 
the sequence loss. 

1 http://rnnlm.org/ 

2 https://github.com/lisa-groundhog/GroundHog 
3 http://www.fit. vutbr.cz/~imikolov/rnnlm/simple-examples.tgz 





















Model 

Perplexity 

5-gram, KN5 

278.6 

FFN-LM(5-gram) 

248.3 

FFN-LM(20-gram) 

228.2 

RNN 

223.4 

LSTM 

206.9 

genCNN 

181.2 

Time-Arrow only 

192 

Time-Flow only 

203 

aCNN only 

184.4 


Table 3: FBIS results. The upper block (row 1-6) compares genCNN and the competitor models, 
and the bottom block (row 7-9) compares different variants of gereCNN. 


From Table [3] (upper block), gereCNN clearly wins again in the comparison to competitors, with 
over 25 point margin over LSTM (in its optimal setting), the second best performer. Interestingly 
genCNN outperforms its variants also quite significantly (bottom block): 1) with only Time-Arrow 
(same number of feature-maps), the performance deteriorates considerably for losing the ability of 
capturing long range correlation reliably; 2) with only Time-Flow the performance gets even worse, 
for partially losing the sensitivity to the prediction task. It is quite remarkable that, although aCNN 
(with L a = 30) can achieve good results, the recursive structure in full genCNN can further decrease 
the perplexity by over 3 points, indicating that genCNN can benefit from modeling the dependency 
over range as long as 30 words. 


6.2 Re-ranking for Machine Translation 

In this experiment, we re-rank the 1000-best English translation candidates for Chinese sentences 
generated by statistical machine translation (SMT) system, and compare it with other language models 
in the same setting. 


SMT setup The baseline hierarchical phrase-based SMT system (Chines—i- English) was built using 
Moses, a widely accepted state-of-the-art, with default settings. The bilingual training data is from 
NIST MT2012 constrained track, with reduced size of 1.1M sentence pairs using selection strategy 
in ( Axelrod et al., 2011 [ I. The baseline use conventional 5-gram language model (LM), estimated 
with modified Kneser-Ney smoothing ( |Chen and Goodman, 1996j ) on the English side of the 329M- 
word Xinhua portion of English Gigaword(LDC2011T07). We also try FFN-LM, as a much stronger 


language model in decoding. The weights of all the features are tuned via MERT (Och and Ney, 
2002) on NIST MT05, and tested on NIST MT06 and MT08. Case-insensitive NIST BLEE^Jis used 
in evaluation. 

Re-ranking with penCNN significantly improves the quality of the final translation. Indeed, it can 
increase the BLEU score by over 1.33 point over Moses baseline on average. This boosting force 
barely slacks up on translation with a enhanced language model in decoding: penCNN re-ranker still 
achieves 1.29 point improvement on top of Moses with FFN-LM, which is 1.76 point over the Moses 
(default setting). To see the significance of this improvement, the state-of-the-art Neural Network 


Joint Model (Devlin et al., 2014) usually brings less than one point increase on this task. 


4 ftp://jaguar.ncsl. nist.gov/mt/resources/mteval-vl lb.pl 























Models 

MT06 

MT08 

Ave. 

Baseline 

38.63 

31.11 

34.87 

RNN rerank 

39.03 

31.50 

35.26 

LSTM rerank 

39.20 

31.90 

35.55 

FFN-LM rerank 

38.93 

31.41 

35.14 

genCNN rerank 

39.90 

32.50 

36.20 

Base+FFN-LM 

39.08 

31.60 

35.34 

genCNN rerank 

40.4 

32.85 

36.63 


Table 4: The results for re-ranking the 1000-best of Moses. Note that the two bottom rows arc on a 
baseline with enhanced LM. 


7 Related Work 


In addition to the long thread of work on neural network based language model (Auli et al., 2013 


Mik olov et al., 2010}|Graves, 201 3[ Bengio et al., 2003 Vaswani et al., 2013) ), our work is also related 
to the effort on modeling long range dependency in word sequence prediction Wu and Khudanpur, 


2003). Different from those work on hand-crafting features for incorporating long range dependency, 
our model can elegantly assimilate relevant information in an unified way, in both long and short 
range, with the bottom-up information flow and convolutional architecture. 

CNN has been widely used in computer vision and speech ([Lawrence et al., 1997[ Krizhevsky et 


al., 2012 LeCun and Bengio, 1995 [Abdel-Hamid et al., 2012 1 , and lately in sentence representa 


tion(Kalchbrenner and Blunsom, 2013), matching!Hu et al., 2014) and classification!Kalchbrenner et 


al., 2014). To our best knowledge, it is the first time this is used in word sequence prediction. Model- 


wise the previous work that is closest to penCNN is the convolution model for predicting moves in the 


Go game (Maddison et al., 2014), which, when applied recurrently, essentially generates a sequence. 
Different from the conventional CNN taken in ( [Maddison et al., 2014] ), penCNN has architectures de¬ 
signed for modeling the composition in natural language and the temporal structure of word sequence. 


8 Conclusion 


We propose a convolutional architecture for natural language generation and modeling. Our extensive 
experiments on sentence generation, perplexity, and n-best re-ranking for machine translation show 
that our model can significantly improve upon state-of-the-arts. 
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